-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openmp #439
Openmp #439
Conversation
Is there a benchmark showing how much speedup it brings compared to a multi-thread blas implementation? For many layers (e.g. convolution) having a multithread blas is already fast and making explicit openmp parallelization makes code more complex with little improvement. |
I have already tried the openmp version on |
I do convolutional Forward() and Backward() in parallel for different images in the batch. As a result CPU code is much faster now: on desktop i saw 3.5x speed-up, and on server with 16 cores ~ 10x speed-up. Still it is not as fast as GPU. The change which I did in im2col and col2im was not related to OpenMP: I removed division by modification of for ( c....) , and added early exit for boundary cases. |
On benchmarks: I used cifar10 and imagenet train as bechmarks. I compared OpenMP with current dev version (CPU). i used 2 machines for testing: desktop (Ivybriidge, 1 socket x 4 cores x no HT). |
I saw your code. You do parallel on |
Forward() can be done independently for each image n , but I had to multiplicate buffer for im2col to avoid collision between threads. For Backward() I multiplicated buffer for weight_dif, and summed them all after all threads finish their jobs. |
I tested openmp version with MKL and found some issues which I would like to investigate. |
Fixed Makefile to support openmp with MKL. Current benchmark results: |
Fixed OpenMP + MKL . Tested on imagenet (200 train iteration).Difference between CPU ( 2 socket server Xeon E5-2680 ) and GPU(K20) is 1.5x ( CPU is slower) /* For cifar10 CPU is faster */ |
To reproduce results with MKL, you should 1. get a free non-commercial version here: https://software.intel.com/en-us/non-commercial-software-development |
Good job! will take a look |
col_buffer_mt_.resize(num_of_threads_ * | ||
channels_ * kernel_size_ * kernel_size_ * height_out * width_out); | ||
weight_diff_mt_.resize(num_of_threads_ * | ||
num_output_ * (channels_ / group_)* kernel_size_ * kernel_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As has also been pointed out in the discussion, I am worried about the intermediate data size: for example, for an input of size 55_55_256 and kernel size 3*3 (I am just making random numbers, may not correspond to actual imagenet layers) and convolution is with stride 1, a single buffer will have size
55 * 55 * 3 * 3 * 256 * sizeof(float) = 28MB
with multi threads (like 10) this will grow to like 280MB, and with multiple convolutional layers, this may quickly grow gigabytes. That's why I feel that we should rely on multithreading BLAS to speed things up rather than having a single-thread BLAS and explicit openmp code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to run caffe with large net OverFeat ( http://cilvr.nyu.edu/doku.php?id=software:overfeat:start ) on K20, but model did not fit K20 DRAM. I was able to train it on CPU. The footprint for 16 threads was ~ 16 GB(!). The train speed was ~ 5 images/sec. I used server with 32 GB , 2x E5x2670 2.6 GHz.
MEMORY: Total memory overhead is realtively small . I run imagenet train on my desktop with OMP_NUM_THREADS= 4, and never see memory utilization above 5 GB. Since default configuration of desktop these days is 8 GB, this overhead seems to be not an issue. Even for servers, when I set OMP_NUM_THREADS= 24, the size was still around 5-5.5 GB. |
Your benchmarks are very impressive, thanks for reporting these numbers! I'll hopefully be able to give this a try at some point, and would definitely be in favor of merging this if I see speed/memory numbers anywhere near what you report (perhaps at first into a new branch off dev, to make it more easily available while doing further testing and refinement if necessary). I'm pretty busy over the next few weeks though; maybe somebody else will beat me to it. |
These are setup details:
Test:: imagenet, 100 train iteration (batch = 256).
|
Hi @borisgin, sorry it's taken me forever to get around to this. Have you tested at all with OpenBLAS, or only MKL? I tried this out with OpenBLAS, and didn't see performance improve on a machine with 32 Here's what I did: Initially I was using the system OpenBLAS library, but I saw a bunch of error messages when I ran the training to recompile OpenBLAS with the option Using your branch, these are the results I see for imagenet training (
Around 26 seconds per iteration. Then, I reran with
That's ~29 seconds per iteration. I'm using an 8 core machine. Do you think I could've done something wrong with the setup, and/or do you expect this will only work well with MKL? I can retry it with MKL if you think that's the problem. (I notice that when I open "top" running with just 1 OMP thread, I see CPU utilization often >1000%, so it seems like the blas library is doing quite a bit of parallelization already as @Yangqing suggested.) |
I test borisgin's improvement, it indeed speeds up on CPU, but not scales well like him said. Our result is 3x+ for 16 cores. @jeffdonahue , is your 8 core machine all physical cores? |
Ah, nice catch -- there are actually only 2 physical cores on the machine I was using. I had looked at /proc/cpuinfo and misinterpreted the result (even knowing that it is easily misinterpretable)...it prints 32 "processors" with cpu cores: "8", but in fact the "physical id" is only 0 or 1 so I guess I have 2 physical cores. I will try again on a more powerful machine, sorry. |
@jeffdonahue it means you have 2 physical cpus, each cpu has 8 cores, each core have 2 hyper-thread. So you can get 32 threads( processor).
grep 'physical id' /proc/cpuinfo | sort -u grep 'core id' /proc/cpuinfo | sort -u | wc -l grep 'processor' /proc/cpuinfo | sort -u | wc -l dmidecode -s processor-version Check the OpenBLAS compiler command with OpenMP. Or set them with "void openblas_set_num_threads(int num_threads);" By default with USE_OPENMP=1 it will use all the threads said by @xianyi |
@jeffdonahue your machine has 2 cpu, each cpu has 8 core, So the number of physical core is 16, the processor(hyper-thread)=32. Thus, you can set OMP_THREADS_NUM=16. But how is your batchsize? when using openmp, the overhead of creating threads is not trivial, you can not use very small batchsize. |
Hi, @jeffdonahue , I did not tested OpenMP + OpenBLAS combination. I tested it with SMT (hyper-threading disabled in BIOS, since it does not help for matrix multiplication at all. Did you re-build OpenBLAS for your CPU? I have server with 2 sockets x E5-2680 , each CPU has 8 cores/8 threads, 2.7 GHZ. I run imagenet train with batch size = 256 for 100 iterations. |
Hi , @borisgin , |
First I used what I get with "sudo apt-get install libopenblas-base" |
Hi , @borisgin , Intel Xeon E5-2680 is sandy bridge arch, not haswell. For sandy bridge arch, OpenBLAS 0.2.9 rolled back the sgemm kernel to the old core2 kernel. In latest 0.2.10 version, we enabled optimized sgemm kernel for sandy bridge arch. Could you retest the performance with the new OpenBLAS version? |
I got new desktop with Haswell CPU, so I wanted to rebuild the code to check how new AVX2 instructions (FMA) impact the performance. I will retest on it the performance for OpenBLAS vs (OpenBlas + OpenMP). |
Hi, @ xianyi, |
Rebased openmp with latest dev branch:
|
I found curious performance bug, related to MKL alternative implementation of functions which are lacking in OpenBLAS. I did some quick fix in lrn_layer.cpp: @xianyi, any plans to add fast powd to OpenBLAS? Thanks, Boriscaffe with OPENBLAS I1117 14:03:46.019815 17778 caffe.cpp:268] norm2 backward: 692.64 ms.caffe with MKL: I1119 10:58:23.674092 29606 caffe.cpp:268] norm2 backward: 47.9396 ms.caffe with MKL and OpenMP |
@borisgin , I already added a feature request for this function. |
Is this PR (or something similar) going to be merged soon? When I checked not too long ago, CPU Caffe is unnecessarily slow without OpenMP. I'm debating tacking my own OpenMP things together... but merging this PR with present-day Caffe would be optimal. |
I wouldn't hold my breath. This PR is more than a year old and it doesn't seem the maintainers think it should be merged. (probably better to close it) |
While CPU execution can be further optimized this PR is closed since it is against the deprecated dev branch. This branch was not merged at the time due to concerns about further complexity and dependencies. Thanks for your work @borisgin. |
@shelhamer @bhack |
Hi, I was try to build your openmp version, on a CentOS 6.5 computer and I got a protobuf version error |
Hi Crefeda, |
@Crefeda If I recall correctly, caffe relies on some features that are introduced in protobuf 2.5, so I think 2.5 and above should work. |
Parallel version of caffe for CPU based on OpenMP. Significant speed-up on CPU .Scales well with number of cores (3x for 4 cores , 10x - for 16 cores). Modified files: constitutional , pulling and relu layers , im2col and col2im.